--- layout: post title: "Kaggle Happy whale Data Exploratory Analysis" subtitle: "Kaggle ipython notebook export test" background: '/img/posts/happy-whale/HappyWhale.jpeg' ---
timm library for pretrained models to generate image-embeddings.tf-efficientnet-b0 model.224x2241280 (output of the pooling layer of tf-efficientnet-b0)new_individual.new_individual from image embeddings?So, follow me, and dare to face the unknown, and ponder the question: What if? ;)

Here are some of my notebooks for this competition, please upvote if you find them useful
!pip install -qU timm wandb imagesize
import os
from glob import glob
from tqdm.notebook import tqdm
import numpy as np
import math
import random
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import cv2
import imagesize
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import timm
try:
from cuml import TSNE, UMAP # if gpu is ON
except:
from sklearn.manifold import TSNE # for cpu
import wandb
import IPython.display as ipd
class CFG:
seed = 42
base_path = 'input/happy-whale-and-dolphin'
embed_path = 'input/happywhale-embedding-dataset' # `None` for creating embeddings otherwise load
ckpt_path = 'input/arcface-gem-dataset/Loss15.2453_epoch3.bin' # checkpoint for finetuned model by debarshichanda
num_samples = None # None for all samples
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
competition = 'happywhale'
_wandb_kernel = 'awsaf49'
def seed_torch(seed_value):
random.seed(seed_value) # Python
np.random.seed(seed_value) # cpu vars
torch.manual_seed(seed_value) # cpu vars
if torch.cuda.is_available():
torch.cuda.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value) # gpu vars
if torch.backends.cudnn.is_available:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
print('# SEEDING DONE')
seed_torch(CFG.seed)
# SEEDING DONE
Weights & Biases (W&B) is MLOps platform for tracking our experiemnts. We can use it to Build better models faster with experiment tracking, dataset versioning, and model management. Some of the cool features of W&B:
import wandb
try:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
api_key = user_secrets.get_secret("WANDB")
wandb.login(key=api_key)
anonymous = None
except:
anonymous = "must"
wandb.login(anonymous=anonymous)
print('To use your W&B account,\nGo to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. \nGet your W&B access token from here: https://wandb.ai/authorize')
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving. wandb: Currently logged in as: afterscool (use `wandb login --relogin` to force relogin)
To use your W&B account, Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. Get your W&B access token from here: https://wandb.ai/authorize
train_images/ - a folder containing the training imagestrain.csv - provides the species and the individual_id for each of the training imagestest_images/ - a folder containing the test images; for each image, your task is to predict the individual_id; no species information is given for the test data; there are individuals in the test data that are not observed in the training data, which should be predicted as new_individual.sample_submission.csv - a sample submission file in the correct formatNote: We don't have access to
speciescolumn for test data. So, we can't direcly usespeciesfor train.
df = pd.read_csv(f'{CFG.base_path}/train.csv')
df['image_path'] = CFG.base_path+'/train_images/'+df['image']
df['split'] = 'Train'
test_df = pd.read_csv(f'{CFG.base_path}/sample_submission.csv')
test_df['image_path'] = CFG.base_path+'/test_images/'+test_df['image']
test_df['split'] = 'Test'
print('Train Images: {:,} | Test Images: {:,}'.format(len(df), len(test_df)))
Train Images: 51,033 | Test Images: 27,956
Size of the dataset is huge. We can control the size using CFG.num_samples
if CFG.num_samples:
df = df.iloc[:CFG.num_samples]
test_df = test_df.iloc[:CFG.num_samples]
Folowing cells,
beluga, globis to whales for 2class label.# convert beluga, globis to whales
df.loc[df.species.str.contains('beluga'), 'species'] = 'beluga_whale'
df.loc[df.species.str.contains('globis'), 'species'] = 'short_finned_pilot_whale'
df.loc[df.species.str.contains('pilot_whale'), 'species'] = 'short_finned_pilot_whale'
df['class'] = df.species.map(lambda x: 'whale' if 'whale' in x else 'dolphin')
# fix duplicate labels
# https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/304633
df['species'] = df['species'].str.replace('bottlenose_dolpin','bottlenose_dolphin')
df['species'] = df['species'].str.replace('kiler_whale','killer_whale')
def get_imgsize(row):
row['width'], row['height'] = imagesize.get(row['image_path'])
return row
# Train
tqdm.pandas(desc='Train ')
df = df.progress_apply(get_imgsize, axis=1)
df.to_csv('train.csv', index=False)
# Test
tqdm.pandas(desc='Test ')
test_df = test_df.progress_apply(get_imgsize, axis=1)
test_df.to_csv('test.csv',index=False)
Train : 0%| | 0/51033 [00:00<?, ?it/s]
Test : 0%| | 0/27956 [00:00<?, ?it/s]
print('Train:')
display(df.head(2))
print('Test:')
display(test_df.head(2))
Train:
| image | species | individual_id | image_path | split | class | width | height | |
|---|---|---|---|---|---|---|---|---|
| 0 | 00021adfb725ed.jpg | melon_headed_whale | cadddb1636b9 | input/happy-whale-and-dolphin/train_images/000... | Train | whale | 804 | 671 |
| 1 | 000562241d384d.jpg | humpback_whale | 1a71fbb72250 | input/happy-whale-and-dolphin/train_images/000... | Train | whale | 3504 | 2336 |
Test:
| image | predictions | image_path | split | width | height | |
|---|---|---|---|---|---|---|
| 0 | 000110707af0ba.jpg | 37c7aba965a5 114207cab555 a6e325d8e924 19fbb96... | input/happy-whale-and-dolphin/test_images/0001... | Test | 3599 | 2399 |
| 1 | 0006287ec424cb.jpg | 37c7aba965a5 114207cab555 a6e325d8e924 19fbb96... | input/happy-whale-and-dolphin/test_images/0006... | Test | 3600 | 2400 |
It seems we also have a Class Imablance across different species. We may want to split our data stratifying species.
data = df.species.value_counts().reset_index()
fig = px.bar(data, x='index', y='species', color='species',title='Species', text_auto=True)
fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show()
data = df['class'].value_counts().reset_index()
fig = px.bar(data, x='index', y='class', color='class', title='Whale Vs Dolphin', text_auto=True)
fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show()
fig = px.histogram(df,
x="width",
color="class",
barmode='group',
log_y=True,
title='Width Vs Class')
display(fig.show())
fig = px.histogram(df,
x="height",
color="class",
barmode='group',
log_y=True,
title='Height Vs Class')
display(fig.show())
None
None
fig = px.histogram(pd.concat([df, test_df]),
x="width",
color="split",
barmode='group',
log_y=True,
title='Width Vs Split');
display(fig.show())
fig = px.histogram(pd.concat([df, test_df]),
x="height",
color="split",
barmode='group',
log_y=True,
title='Height Vs Split');
display(fig.show())
None
None
def load_image(path):
img = cv2.imread(path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
return img
class ImageDataset(Dataset):
def __init__(self,
path,
target=None,
input_shape=(128, 256),
transform=None,
channel_first=True,
):
super(ImageDataset, self).__init__()
self.path = path
self.target = target
self.input_shape = input_shape
self.transform = transform
self.channel_first = channel_first
def __len__(self):
return len(self.path)
def __getitem__(self, idx):
img = load_image(self.path[idx])
img = cv2.resize(img, dsize=self.input_shape)
if self.transform is not None:
img = self.transform(image=img)["image"]
if self.channel_first:
img = img.transpose((2, 0, 1))
if self.target is not None:
target = self.target[idx]
return img, target
else:
return img
def get_dataset(path, target=None, batch_size=32, input_shape=(224, 224)):
dataset = ImageDataset(path=path,
target=target,
input_shape=input_shape,
)
dataloader = DataLoader(
dataset,
batch_size=batch_size,
num_workers=2,
shuffle=False,
pin_memory=True,
)
return dataloader
def plot_batch(batch, row=2, col=2, channel_first=True):
if isinstance(batch, tuple) or isinstance(batch, list):
imgs, tars = batch
else:
imgs, tars = batch, None
plt.figure(figsize=(col*3, row*3))
for i in range(row*col):
plt.subplot(row, col, i+1)
img = imgs[i].numpy()
if channel_first:
img = img.transpose((1, 2, 0))
plt.imshow(img)
if tars is not None:
plt.title(tars[i])
plt.axis('off')
plt.tight_layout()
plt.show()
def gen_colors(n=10):
cmap = plt.get_cmap('rainbow')
colors = [cmap(i) for i in np.linspace(0, 1, n + 2)]
colors = [(c[2] * 255, c[1] * 255, c[0] * 255) for c in colors]
return colors
We need to create dataloader to read images efficiently.
train_loader = get_dataset(path=df.image_path.tolist(),
target=df.species.tolist(),
input_shape=(224,224),
)
test_loader = get_dataset(path=test_df.image_path.tolist(),
target=None,
input_shape=(224,224),
)
Let's have a look at some images from Train Data
batch = iter(train_loader).next()
plot_batch(batch, row=2, col=5)
Let's have a look at some images from Test Data
batch = iter(test_loader).next()
plot_batch(batch, row=2, col=5)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool) [W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
class ImageModel(nn.Module):
def __init__(self, backbone_name, pretrained=True):
super(ImageModel, self).__init__()
self.backbone = timm.create_model(backbone_name,
pretrained=pretrained)
self.backbone.reset_classifier(0) # to get pooled features
def forward(self, x):
x = self.backbone(x)
return x
class GeM(nn.Module):
def __init__(self, p=3, eps=1e-6):
super(GeM, self).__init__()
self.p = nn.Parameter(torch.ones(1)*p)
self.eps = eps
def forward(self, x):
return self.gem(x, p=self.p, eps=self.eps)
def gem(self, x, p=3, eps=1e-6):
return F.avg_pool2d(x.clamp(min=eps).pow(p), (x.size(-2), x.size(-1))).pow(1./p)
def __repr__(self):
return self.__class__.__name__ + \
'(' + 'p=' + '{:.4f}'.format(self.p.data.tolist()[0]) + \
', ' + 'eps=' + str(self.eps) + ')'
class FTModel(nn.Module):
"""FineTune (on happywhale dataset) Model"""
def __init__(self, model_name, pretrained=True):
super(FTModel, self).__init__()
self.model = timm.create_model(model_name, pretrained=pretrained)
in_features = self.model.classifier.in_features
self.model.classifier = nn.Identity()
self.model.global_pool = nn.Identity()
self.pooling = GeM()
self.fc = nn.Identity()
def forward(self, images):
features = self.model(images)
pooled_features = self.pooling(features).flatten(1)
return pooled_features
def load_model(ckpt_path):
model = FTModel(model_name='tf_efficientnet_b0', pretrained=False)
model.load_state_dict(torch.load(ckpt_path), strict=False)
model.fc = nn.Identity()
return model
model1 = ImageModel('tf_efficientnet_b0')
model2 = load_model(CFG.ckpt_path)

@torch.no_grad()
def predict(model, dataloader):
model.eval() # turn off layers such as BatchNorm or Dropout
model.to(CFG.device) # cpu -> gpu
embeds = []
pbar = tqdm(dataloader, total=len(dataloader))
for img in pbar:
img = img.type(torch.float32) # uint8 -> float32
img = img.to(CFG.device) # cpu -> gpu
embed = model(img) # this is where magic happens ;)
gpu_mem = torch.cuda.memory_reserved() / 1E9 if torch.cuda.is_available() else 0
pbar.set_postfix(gpu_mem=f'{gpu_mem:0.2f} GB')
embeds.append(embed.cpu().detach().numpy())
return np.concatenate(embeds)
train_loader = get_dataset(
path=df.image_path.tolist(),
target=None,
input_shape=(224,224),
batch_size=128*4,
)
test_loader = get_dataset(
path=test_df.image_path.tolist(),
target=None,
input_shape=(224,224),
batch_size=128*4,
)
if CFG.embed_path:
print('# Load Train Embeddings:')
train_embeds = np.load(f'{CFG.embed_path}/train_embeds.npy')
print('# Load Test Embeddings:')
test_embeds = np.load(f'{CFG.embed_path}/test_embeds.npy')
print('# Load Train Embeddings (Finetune):')
train_embeds2 = np.load(f'{CFG.embed_path}/train_embeds2.npy')
print('# Test Embeddings (Finetune):')
test_embeds2 = np.load(f'{CFG.embed_path}/test_embeds2.npy')
else:
print('# Train Embeddings:')
train_embeds = predict(model1, train_loader)
print('# Test Embeddings:')
test_embeds = predict(model1, test_loader)
print('# Train Embeddings (Finetune):')
train_embeds2 = predict(model2, train_loader)
print('# Test Embeddings (Finetune):')
test_embeds2 = predict(model2, test_loader)
# Save Embeddings
np.save('train_embeds.npy', train_embeds)
np.save('test_embeds.npy', test_embeds)
np.save('train_embeds2.npy', train_embeds2)
np.save('test_embed2.npy', test_embeds2)
# Load Train Embeddings: # Load Test Embeddings: # Load Train Embeddings (Finetune): # Test Embeddings (Finetune):

tsne = TSNE()
# Concatenate both train and test
embeds = np.concatenate([train_embeds,test_embeds])
embeds2 = np.concatenate([train_embeds2,test_embeds2])
# Fit TSNE on the embeddings and then transfer data
tsne_embed = tsne.fit_transform(embeds)
tsne_embed2 = tsne.fit_transform(embeds2)
# Train
df['x'] = tsne_embed[:len(train_embeds),0]
df['y'] = tsne_embed[:len(train_embeds),1]
df['x2'] = tsne_embed2[:len(train_embeds2),0]
df['y2'] = tsne_embed2[:len(train_embeds2),1]
# Test
test_df['x'] = tsne_embed[len(train_embeds):,0]
test_df['y'] = tsne_embed[len(train_embeds):,1]
test_df['x2'] = tsne_embed2[len(train_embeds2):,0]
test_df['y2'] = tsne_embed2[len(train_embeds2):,1]
/media/testmony/2TB_SSD/miniconda3/envs/conda_happywhale/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:783: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. /media/testmony/2TB_SSD/miniconda3/envs/conda_happywhale/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:793: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2. /media/testmony/2TB_SSD/miniconda3/envs/conda_happywhale/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:783: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2. /media/testmony/2TB_SSD/miniconda3/envs/conda_happywhale/lib/python3.7/site-packages/sklearn/manifold/_t_sne.py:793: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
umap = UMAP()
# Fit TSNE on the embeddings and then transfer data
umap_embed = umap.fit_transform(embeds)
umap_embed2 = umap.fit_transform(embeds2)
# Train
df['x3'] = umap_embed[:len(train_embeds),0]
df['y3'] = umap_embed[:len(train_embeds),1]
df['x4'] = umap_embed2[:len(train_embeds2),0]
df['y4'] = umap_embed2[:len(train_embeds2),1]
# Test
test_df['x3'] = umap_embed[len(train_embeds):,0]
test_df['y3'] = umap_embed[len(train_embeds):,1]
test_df['x4'] = umap_embed2[len(train_embeds2):,0]
test_df['y4'] = umap_embed2[len(train_embeds2):,1]
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /tmp/ipykernel_11887/704066735.py in <module> ----> 1 umap = UMAP() 2 3 # Fit TSNE on the embeddings and then transfer data 4 umap_embed = umap.fit_transform(embeds) 5 umap_embed2 = umap.fit_transform(embeds2) NameError: name 'UMAP' is not defined
2D Projection: Plot.2D Projection: Plot. In this notebook, We'll be using this 2nd way but you are encouraged to try both.# convert config from class to dict
config = {k:v for k,v in dict(vars(CFG)).items() if '__' not in k}
# initialize wandb project
wandb.init(project='happywhale-public', config=config)
# process data for wandb
wdf1 = pd.concat([df, test_df]).drop(columns=['image_path','predictions']) # train + test
wdf2 = df.copy() # only train as some columns of test don't have any value e.g: species
# log the data
wandb.log({"All":wdf1,
"Train":wdf2}) # log both result
# save embeddings to wandb for later use
wandb.save('test_embeds.npy'); # save train embeddings
wandb.save('train_embeds.npy'); # save test embeddings
wandb.save('test_embeds2.npy'); # save train embeddings
wandb.save('train_embeds2.npy'); # save test embeddings
# show wandb dashboard
display(ipd.IFrame(wandb.run.url, width=1080, height=720)) # show wandb dashboard
# finish logging
wandb.finish()
VBox(children=(Label(value=' 24.93MB of 27.15MB uploaded (0.00MB deduped)\r'), FloatProgress(value=0.918028193…
./wandb/run-20220225_015420-38j77d0b/logsAfter logging WandB output directory will look like this,
</a>
And Embedding Plot will look something like this,
</a>
</a>
</a>
Huuuf! We have come so far. Let's visualize the image embeddings using T-SNE.
x_min = df.x.min()
x_max = df.x.max()
y_min = df.y.min()
y_max = df.y.max()
def plot(df, ROW, COL):
plt.figure(figsize=(15,16*ROW/COL))
for k in range(ROW):
for j in range(COL):
plt.subplot(ROW,COL,k*COL+j+1)
row = df.iloc[k*COL+j]
img = cv2.imread(row.image_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.axis('off')
id_ = row['image']
try:
species = row['species']
class_ = row['class']
except:
species = None
class_ = None
plt.title('id:{}\nclass:{}\nspecies:{}'.format(id_, class_, species))
plt.imshow(img)
plt.tight_layout()
plt.show()
def plot_tsne(df1, df2, labels=['Train', 'Test'], colors=['orange','blue']):
plt.figure(figsize=(10,10))
plt.scatter(df1.x, df1.y,color=colors[0],s=10,label=labels[0])
plt.scatter(df2.x, df2.y,color=colors[1],s=10,label=labels[1], alpha=0.4)
plt.plot([xa_mx,xa_mx],[ya_mx,yb_mx],color='black')
plt.plot([xa_mx,xb_mx],[ya_mx,ya_mx],color='black')
plt.plot([xb_mx,xb_mx],[ya_mx,yb_mx],color='black')
plt.plot([xa_mx,xb_mx],[yb_mx,yb_mx],color='black')
plt.legend()
plt.show()
new_individual can't be indentified from image embedding. So, we can say,new_individual class looks very similar to old_indiviual.new_individual class.new_individual has similar distribution as old_individual.plt.figure(figsize=(15,15))
plt.subplot(2, 2, 1)
plt.scatter(df.x,df.y,color='orange',s=10,label='Train')
plt.scatter(test_df.x,test_df.y,color='blue',s=10,label='Test', alpha=0.5)
plt.title('T-SNE')
plt.legend(prop={'size': 12})
plt.subplot(2, 2, 2)
plt.scatter(df.x2,df.y2,color='orange',s=10,label='Train')
plt.scatter(test_df.x2,test_df.y2,color='blue',s=10,label='Test', alpha=0.5)
plt.title('T-SNE (Finetune)')
plt.legend(prop={'size': 12})
plt.subplot(2, 2, 3)
plt.scatter(df.x3,df.y3,color='red',s=10,label='Train')
plt.scatter(test_df.x3,test_df.y3,color='green',s=10,label='Test', alpha=0.5)
plt.title('UMAP')
plt.legend(prop={'size': 12})
plt.subplot(2, 2, 4)
plt.scatter(df.x4,df.y4,color='red',s=10,label='Train')
plt.scatter(test_df.x4,test_df.y4,color='green',s=10,label='Test', alpha=0.5)
plt.title('UMAP (Finetune)')
plt.legend(prop={'size': 12})
plt.tight_layout()
plt.show()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) /tmp/ipykernel_11887/2646855895.py in <module> 14 15 plt.subplot(2, 2, 3) ---> 16 plt.scatter(df.x3,df.y3,color='red',s=10,label='Train') 17 plt.scatter(test_df.x3,test_df.y3,color='green',s=10,label='Test', alpha=0.5) 18 plt.title('UMAP') /media/testmony/2TB_SSD/miniconda3/envs/conda_happywhale/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name) 5485 ): 5486 return self[name] -> 5487 return object.__getattribute__(self, name) 5488 5489 def __setattr__(self, name: str, value) -> None: AttributeError: 'DataFrame' object has no attribute 'x3'
Let's plot some train and test imaes.
X_DIV and Y_DIV. The bigger the value the smaller the dimension of the window.ROW = 2
COL = 5
X_DIV = 20; Y_DIV = 20;
x_step = (x_max - x_min)/X_DIV
y_step = (y_max - y_min)/Y_DIV
for it in range(5):
i = 0; i2=0; trial=0;
while i<ROW*COL:
trial+=1
if trial>50:
break
k = np.random.randint(0,X_DIV)
j = np.random.randint(0,Y_DIV)
xa_mx = k*x_step + x_min
xb_mx = (k+1)*x_step + x_min
ya_mx = j*y_step + y_min
yb_mx = (j+1)*y_step + y_min
df1 = df.loc[(df.x>xa_mx)&(df.x<xb_mx)&(df.y>ya_mx)&(df.y<yb_mx)]
df2 = test_df.loc[(test_df.x>xa_mx)&(test_df.x<xb_mx)&(test_df.y>ya_mx)&(test_df.y<yb_mx)]
i = len(df1)
i2 = len(df2)
print(f'### RANDOM: {it}')
print('>>TSNE:')
plot_tsne(df, test_df)
print('>>Train:')
if i>=ROW*COL:
plot(df1, ROW, COL)
else:
print('Not Found')
print('>>Test')
if i2>=ROW*COL:
plot(df2, ROW, COL)
else:
print('Not Found')
print('\n\n')
### RANDOM: 0 >>TSNE:
>>Train:
>>Test
### RANDOM: 1 >>TSNE:
>>Train:
>>Test
### RANDOM: 2 >>TSNE:
>>Train:
>>Test
### RANDOM: 3 >>TSNE:
>>Train:
>>Test
### RANDOM: 4 >>TSNE:
>>Train:
>>Test
Let's plot the image embeddings of Whales and Dolphins using T-SNE.
w_df = df[df['class']=='whale']
d_df = df[df['class']=='dolphin']
plt.figure(figsize=(15,15))
plt.subplot(2, 2, 1)
plt.scatter(w_df.x,w_df.y,color='orange',s=10,label='Whale')
plt.scatter(d_df.x,d_df.y,color='blue',s=10,label='Dolphin', alpha=0.4)
plt.title('T-SNE')
plt.legend(prop={'size': 12})
plt.subplot(2, 2, 2)
plt.scatter(w_df.x2,w_df.y2,color='orange',s=10,label='Whale')
plt.scatter(d_df.x2,d_df.y2,color='blue',s=10,label='Dolphin', alpha=0.4)
plt.title('T-SNE (Finetune)')
plt.legend(prop={'size': 12})
plt.subplot(2, 2, 3)
plt.scatter(w_df.x3,w_df.y3,color='red',s=10,label='Whale')
plt.scatter(d_df.x3,d_df.y3,color='green',s=10,label='Dolphin', alpha=0.4)
plt.title('UMAP')
plt.legend(prop={'size': 12})
plt.subplot(2, 2, 4)
plt.scatter(w_df.x4,w_df.y4,color='red',s=10,label='Whale')
plt.scatter(d_df.x4,d_df.y4,color='green',s=10,label='Dolphin', alpha=0.4)
plt.title('UMAP (Finetune)')
plt.legend(prop={'size': 12})
plt.tight_layout()
plt.show()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) /tmp/ipykernel_11887/3815458837.py in <module> 17 18 plt.subplot(2, 2, 3) ---> 19 plt.scatter(w_df.x3,w_df.y3,color='red',s=10,label='Whale') 20 plt.scatter(d_df.x3,d_df.y3,color='green',s=10,label='Dolphin', alpha=0.4) 21 plt.title('UMAP') /media/testmony/2TB_SSD/miniconda3/envs/conda_happywhale/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name) 5485 ): 5486 return self[name] -> 5487 return object.__getattribute__(self, name) 5488 5489 def __setattr__(self, name: str, value) -> None: AttributeError: 'DataFrame' object has no attribute 'x3'
Let's plot some images from Whales and Dolphins. It seems that they lookalike.
ROW = 2
COL = 5
X_DIV = 15; Y_DIV = 15;
x_step = (x_max - x_min)/X_DIV
y_step = (y_max - y_min)/Y_DIV
w_df = df[df['class']=='whale']
d_df = df[df['class']=='dolphin']
for it in range(5):
i = 0; i2=0; trial=0;
while i<ROW*COL:
trial+=1
if trial>50:
break
k = np.random.randint(0,X_DIV)
j = np.random.randint(0,Y_DIV)
xa_mx = k*x_step + x_min
xb_mx = (k+1)*x_step + x_min
ya_mx = j*y_step + y_min
yb_mx = (j+1)*y_step + y_min
df1 = w_df.loc[(w_df.x>xa_mx)&(w_df.x<xb_mx)&(w_df.y>ya_mx)&(w_df.y<yb_mx)]
df2 = d_df.loc[(d_df.x>xa_mx)&(d_df.x<xb_mx)&(d_df.y>ya_mx)&(d_df.y<yb_mx)]
i = len(df1)
i2 = len(df2)
print(f'### RANDOM: {it}')
print('>>TSNE:')
plot_tsne(w_df, d_df, labels=['Whale', 'Dolphin'], colors=['red', 'green'])
print('>>Whale:')
if i>=ROW*COL:
plot(df1, ROW, COL)
else:
print('Not Found')
print('>>Dolphin:')
if i2>=ROW*COL:
plot(df2, ROW, COL)
else:
print('Not Found')
print('\n\n')
### RANDOM: 0 >>TSNE:
>>Whale:
>>Dolphin:
### RANDOM: 1 >>TSNE:
>>Whale:
>>Dolphin:
### RANDOM: 2 >>TSNE:
>>Whale:
>>Dolphin:
### RANDOM: 3 >>TSNE:
>>Whale:
>>Dolphin:
### RANDOM: 4 >>TSNE:
>>Whale:
>>Dolphin:
Let's look at the species of Whales in T-SNE.
plt.figure(figsize=(20,10))
n_species = w_df.species.nunique()
colors = gen_colors(n=n_species)
plt.subplot(1, 2, 1)
for i, species in enumerate(w_df.species.unique()):
s_df = w_df.query("species==@species")
color = '#%02x%02x%02x'%tuple(int(c) for c in colors[i])
plt.scatter(s_df.x,s_df.y,s=10,color=color, label=species)
plt.title('T-SNE')
plt.legend(prop={'size': 10})
plt.subplot(1, 2, 2)
for i, species in enumerate(w_df.species.unique()):
s_df = w_df.query("species==@species")
color = '#%02x%02x%02x'%tuple(int(c) for c in colors[i])
plt.scatter(s_df.x2,s_df.y2,s=10,color=color, label=species)
plt.title('T-SNE (Finetune)')
plt.legend(prop={'size': 10})
plt.tight_layout()
plt.show()
Let's look at the species of Dolphin in T-SNE.
plt.figure(figsize=(20,10))
n_species = d_df.species.nunique()
colors = gen_colors(n=n_species)
plt.subplot(1, 2, 1)
for i, species in enumerate(d_df.species.unique()):
s_df = d_df.query("species==@species")
color = '#%02x%02x%02x'%tuple(int(c) for c in colors[i])
plt.scatter(s_df.x,s_df.y,s=10,color=color, label=species)
plt.title('T-SNE')
plt.legend(prop={'size': 10})
plt.subplot(1, 2, 2)
for i, species in enumerate(d_df.species.unique()):
s_df = d_df.query("species==@species")
color = '#%02x%02x%02x'%tuple(int(c) for c in colors[i])
plt.scatter(s_df.x2,s_df.y2,s=10,color=color, label=species)
plt.title('T-SNE (Finetune)')
plt.legend(prop={'size': 10})
plt.tight_layout()
plt.show()
new_individual never appears in train data, they have similar distribution as train data.new_individual in the test data. Hence, their distribution wasn't much visible in T-SNE plot.killer, southern and pilot Whales dominates over other species.bottlenose and dusky Dolphins dominates over other speceis.!rm -rf ./wandb